Data Preparation for Machine Learning

Data preparation is a vital step in the machine learning pipeline. Just as visualization is necessary to understand the relationships in data, proper preparation or data munging is required to ensure machine learning models work optimally.

The process of data preparation is highly interactive and iterative. A typical process includes at least the following steps:

  1. Visualization of the dataset to understand the relationships and identify possible problems with the data.
  2. Data cleaning and transformation to address the problems identified. It many cases, step 1 is then repeated to verify that the cleaning and transformation had the desired effect.
  3. Construction and evaluation of a machine learning models. Visualization of the results will often lead to understanding of further data preparation that is required; going back to step 1.

In this lab you will learn the following:

  • Recode character strings to eliminate characters that will not be processed correctly.
  • Find and treat missing values.
  • Set correct data type of each column.
  • Transform categorical features to create categories with more cases and coding likely to be useful in predicting the label.
  • Apply transformations to numeric features and the label to improve the distribution properties.
  • Locate and treat duplicate cases.

An example

As a first example you will prepare the automotive dataset. Careful preparation of this dataset, or any dataset, is required before atempting to train any machine learning model. This dataset has a number of problems which must be addressed. Further, some feature engineering will be applied.

Load the dataset

As a first step you must load the dataset.

Execute the code in the cell below to load the packages required to run this notebook.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import numpy.random as nr
import math

%matplotlib inline

Execute the code in the cell below to load the dataset and print the first few rows of the data frame.

In [2]:
auto_prices = pd.read_csv('Automobile price data _Raw_.csv')
auto_prices[:2]
Out[2]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500

2 rows × 26 columns

You will now perform some data preparation steps.

Recall the function clean_auto_data() in the previous lab. The next few sections perform the same tasks to the clean_auto_data() function, with detail explanations.

Recode names

Notice that several of the column names contain the '-' character. Python will not correctly recognize character strings containing '-'. Rather, such a name will be recognized as two character strings. The same problem will occur with column values containing many special characters including, '-', ',', '*', '/', '|', '>', '<', '@', '!' etc. If such characters appear in column names of values, they must be replaced with another character.

Execute the code in the cell below to replace the '-' characters by '_':

In [3]:
auto_prices.columns = [str.replace('-', '_') for str in auto_prices.columns]

Treat missing values

Missing values are a common problem in data set. Failure to deal with missing values before training a machine learning model will lead to biased training at best, and in many cases actual failure. The Python scikit-learn package will not process arrays with missing values.

There are two problems that must be deal with when treating missing values:

  1. First you must find the missing values. This can be difficult as there is no standard way missing values are coded. Some common possibilities for missing values are:
    • Coded by some particular character string, or numeric value like -999.
    • A NULL value or numeric missing value such as a NaN.
  2. You must determine how to treat the missing values:
    • Remove features with substantial numbers of missing values. In many cases, such features are likely to have little information value.
    • Remove rows with missing values. If there are only a few rows with missing values it might be easier and more certain to simply remove them.
    • Impute values. Imputation can be done with simple algorithms such as replacing the missing values with the mean or median value. There are also complex statistical methods such as the expectation maximization (EM) or SMOTE algorithms.
    • Use nearest neighbor values. Alternatives for nearest neighbor values include, averaging, forward filling or backward filling.

Carefully observe the first few cases from the data frame and notice that missing values are coded with a '?' character. Execute the code in the cell below to identify the columns with missing values.

In [ ]:
# 先排除兩個欄位後,轉換資料型態為字串取出其中有?號的資料列
qdata = (auto_prices[list(set(auto_prices.columns) - 
                          set(['normalized_losses','num_of_doors']))].astype(np.object) == '?').sum(axis=1) 
# 先移除 normalized_losses 欄位
auto_prices.drop(['normalized_losses'], axis=1 , inplace=True)
# 根據第一段取出的資料列 進行移除
auto_prices.drop(qdata[qdata > 0].index , inplace=True)
auto_prices.shape
In [ ]:
data = (auto_prices.astype(np.object) == '?').sum().sort_values(ascending=False)
data[data > 0].index.tolist()
In [4]:
(auto_prices.astype(np.object) == '?').any()
Out[4]:
symboling            False
normalized_losses     True
make                 False
fuel_type            False
aspiration           False
num_of_doors          True
body_style           False
drive_wheels         False
engine_location      False
wheel_base           False
length               False
width                False
height               False
curb_weight          False
engine_type          False
num_of_cylinders     False
engine_size          False
fuel_system          False
bore                  True
stroke                True
compression_ratio    False
horsepower            True
peak_rpm              True
city_mpg             False
highway_mpg          False
price                 True
dtype: bool

Execute the code in the cell below to display the data types of each column.

In [5]:
auto_prices.dtypes
Out[5]:
symboling              int64
normalized_losses     object
make                  object
fuel_type             object
aspiration            object
num_of_doors          object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_of_cylinders      object
engine_size            int64
fuel_system           object
bore                  object
stroke                object
compression_ratio    float64
horsepower            object
peak_rpm              object
city_mpg               int64
highway_mpg            int64
price                 object
dtype: object

Compare the columns with missing values to their data types. In all cases, the columns with missing values have an object (character) type as a result of using the '?' code. As a result, some columns that should be numeric (bore, stroke, horsepower, peak_rpm, and price) are coded as object.

The next question is how many missing values are in each of these object type columns? Execute the code in the cell below to display the counts of missing values.

In [6]:
for col in auto_prices.columns:
    if auto_prices[col].dtype == object:
        count = 0
        count = [count + 1 for x in auto_prices[col] if x == '?']
        print(col + ' ' + str(sum(count)))
normalized_losses 41
make 0
fuel_type 0
aspiration 0
num_of_doors 2
body_style 0
drive_wheels 0
engine_location 0
engine_type 0
num_of_cylinders 0
fuel_system 0
bore 4
stroke 4
horsepower 2
peak_rpm 2
price 4

The normalize_losses column has a significant number of missing values and will be removed. Columns that should be numeric, but contain missing values, are processed in the following manner:

  1. The '?' values are computed to Pandas Numpy missing values nan.
  2. Rows containing nan values are removed.

Execute this code, noticing the resulting shape of the data frame.

In [7]:
auto_prices.shape
Out[7]:
(205, 26)
In [8]:
## Drop column with too many missing values
auto_prices.drop('normalized_losses', axis = 1, inplace = True)
## Remove rows with missing values, accounting for mising values coded as '?'
cols = ['price', 'bore', 'stroke',
          'horsepower', 'peak_rpm']
for column in cols:
    auto_prices.loc[auto_prices[column] == '?', column] = np.nan
auto_prices.dropna(axis = 0, inplace = True)
auto_prices.shape    
Out[8]:
(195, 25)

The data set now contains 195 cases and 25 columns. 10 rows have been dropped by removing missing values.

Transform column data type

As has been previously noted, there are five columns in this dataset which do not have the correct type as a result of missing values. This is a common situation, as the methods used to automatically determine data type when loading files can fail when missing values are present.

The code in the cell below iterates over a list of columns setting them to numeric. Execute this code and observe the resulting types.

In [9]:
for column in cols:
    auto_prices[column] = pd.to_numeric(auto_prices[column])
auto_prices[cols].dtypes
Out[9]:
price           int64
bore          float64
stroke        float64
horsepower      int64
peak_rpm        int64
dtype: object

Feature engineering and transforming variables

In most cases, machine learning is not done with the raw features. Features are transformed, or combined to form new features in forms which are more predictive. This process is known as feature engineering. In many cases, good feature engineering is more important than the details of the machine learning model used. It is often the case that good features can make even poor machine learning models work well, whereas, given poor features even the best machine learning model will produce poor results. As the famous saying goes, "garbage in, garbage out".

Some common approaches to feature engineering include:

  • Aggregating categories of categorical variables to reduce the number. Categorical features or labels with too many unique categories will limit the predictive power of a machine learning model. Aggregating categories can improve this situation, sometime greatly. However, one must be careful. It only makes sense to aggregate categories that are similar in the domain of the problem. Thus, domain expertise must be applied.
  • Transforming numeric variables to improve their distribution properties to make them more covariate with other variables. This process can be applied not only features, but to labels for regression problems. Some common transformations include, logarithmic and power included squares and square roots.
  • Compute new features from two or more existing features. These new features are often referred to as interaction terms. An interaction occurs when the behavior of say, the produce of the values of two features, is significantly more predictive than the two features by themselves. Consider the probability of purchase for a luxury mens' shoe. This probability depends on the interaction of the user being a man and the buyer being wealthy. As another example, consider the number of expected riders on a bus route. This value will depend on the interaction between the time of day and if it is a holiday.

Aggregating categorical variables

When a dataset contains categorical variables these need to be investigated to ensure that each category has sufficient samples. It is commonly the case that some categories may have very few samples, or have so many similar categories as to be meaningless.

As a specific case, you will examine the number of cylinders in the cars. Execute the code in the cell below to print a frequency table for this variable and examine the result.

In [10]:
auto_prices['num_of_cylinders'].value_counts()
Out[10]:
four      155
six        24
five       10
eight       4
twelve      1
three       1
Name: num_of_cylinders, dtype: int64

Notice that there is only one car with three and twelve cylinders. There are only four cars with eight cylinders, and 10 cars with five cylinders. It is likely that all of these categories will not have statistically significant difference in predicting auto price. It is clear that these categories need to be aggregated.

The code in the cell below uses a Python dictionary to recode the number of cylinder categories into a smaller number categories. Execute this code and examine the resulting frequency table.

In [11]:
cylinder_categories = {'three':'three_four'  , 'four':'three_four', 
                       'five':'five_six'     , 'six':'five_six',
                       'eight':'eight_twelve', 'twelve':'eight_twelve'}
auto_prices['num_of_cylinders'] = [cylinder_categories[x] for x in auto_prices['num_of_cylinders']]
auto_prices['num_of_cylinders'].value_counts()
Out[11]:
three_four      156
five_six         34
eight_twelve      5
Name: num_of_cylinders, dtype: int64

There are now three categories. One of these categories only has five members. However, it is likely that these autos will have different pricing from others.

Next, execute the code in the cell below to make box plots of the new cylinder categories.

In [12]:
def plot_box(auto_prices, col, col_y = 'price'):
    sns.set_style("whitegrid")
    sns.boxplot(col, col_y, data=auto_prices)
    plt.xlabel(col) # Set text for the x axis
    plt.ylabel(col_y)# Set text for y axis
    plt.show()
    
plot_box(auto_prices, 'num_of_cylinders')    

Indeed, the price range of these categories is distinctive. It is likely that these new categories will be useful in predicting the price of autos.

Now, execute the code in the cell below and examine the frequency table for the body_style feature.

In [13]:
auto_prices['body_style'].value_counts()
Out[13]:
sedan          94
hatchback      63
wagon          24
hardtop         8
convertible     6
Name: body_style, dtype: int64

Two of these categories have a limited number of cases. These categories can be aggregated to increase the number of cases using a similar approach as used for the number of cylinders. Execute the code in the cell below to aggregate these categories.

In [14]:
body_cats = {'sedan':'sedan', 'hatchback':'hatchback', 'wagon':'wagon', 
             'hardtop':'hardtop_convert', 'convertible':'hardtop_convert'}
auto_prices['body_style'] = [body_cats[x] for x in auto_prices['body_style']]
auto_prices['body_style'].value_counts()
Out[14]:
sedan              94
hatchback          63
wagon              24
hardtop_convert    14
Name: body_style, dtype: int64

To investigate if this aggregation of categories was a good idea, execute the code in the cell below to display a box plot.

Then, answer Question 1 on the course page.

In [15]:
def plot_box(auto_prices, col, col_y = 'price'):
    sns.set_style("whitegrid")
    sns.boxplot(col, col_y, data=auto_prices)
    plt.xlabel(col) # Set text for the x axis
    plt.ylabel(col_y)# Set text for y axis
    plt.show()
    
plot_box(auto_prices, 'body_style')    

The hardtop_convert category does appear to have values distinct from the other body style.

Transforming numeric variables

To improve performance of machine learning models transformations of the values are often applied. Typically, transformations are used to make the relationships between variables more linear. In other cases, transformations are performed to make distributions closer to Normal, or at least more symmetric. These transformations can include taking logarithms, exponential transformations and power transformations.

In this case, you will transform the label, the price of the car. Execute the code in the cell below to display and examine a histogram of the label.

In [16]:
def hist_plot(vals, lab):
    ## Distribution plot of values
    sns.distplot(vals)
    plt.title('Histogram of ' + lab)
    plt.xlabel('Value')
    plt.ylabel('Density')
    
#labels = np.array(auto_prices['price'])
hist_plot(auto_prices['price'], 'prices')

The distribution of auto price is both quite skewed to the left and multimodal. Given the skew and the fact that there are no values less than or equal to zero, a log transformation might be appropriate.

The code in the cell below displays a histogram of the logarithm of prices. Execute this code and examine the result.

In [17]:
auto_prices['log_price'] = np.log(auto_prices['price'])
hist_plot(auto_prices['log_price'], 'log prices')

The distribution of the logarithm of price is more symmetric, but still shows some multimodal tendency and skew. Nonetheless, this is an improvement so we will use these values as our label.

The next question is, how does this transformation change the relationship between the label and some of the features? To find out, execute the code in the cell below.

In [19]:
def plot_scatter_shape(auto_prices, cols, shape_col = 'fuel_type', col_y = 'price', alpha = 0.2):
    shapes = ['+', 'o', 's', 'x', '^'] # pick distinctive shapes
    unique_cats = auto_prices[shape_col].unique()
    for col in cols: # loop over the columns to plot
        sns.set_style("whitegrid")
        for i, cat in enumerate(unique_cats): # loop over the unique categories
            temp = auto_prices[auto_prices[shape_col] == cat]
            sns.regplot(col, col_y, data=temp, marker = shapes[i], label = cat,
                        scatter_kws={"alpha":alpha}, fit_reg = False, color = 'blue')
        plt.title('Scatter plot of ' + col_y + ' vs. ' + col) # Give the plot a main title
        plt.xlabel(col) # Set text for the x axis
        plt.ylabel(col_y)# Set text for y axis
        plt.legend()
        plt.show()
            
num_cols = ['curb_weight', 'engine_size', 'horsepower', 'city_mpg']
plot_scatter_shape(auto_prices, num_cols) 
In [18]:
def plot_scatter_shape(auto_prices, cols, shape_col = 'fuel_type', col_y = 'log_price', alpha = 0.2):
    shapes = ['+', 'o', 's', 'x', '^'] # pick distinctive shapes
    unique_cats = auto_prices[shape_col].unique()
    for col in cols: # loop over the columns to plot
        sns.set_style("whitegrid")
        for i, cat in enumerate(unique_cats): # loop over the unique categories
            temp = auto_prices[auto_prices[shape_col] == cat]
            sns.regplot(col, col_y, data=temp, marker = shapes[i], label = cat,
                        scatter_kws={"alpha":alpha}, fit_reg = False, color = 'blue')
        plt.title('Scatter plot of ' + col_y + ' vs. ' + col) # Give the plot a main title
        plt.xlabel(col) # Set text for the x axis
        plt.ylabel(col_y)# Set text for y axis
        plt.legend()
        plt.show()
            
num_cols = ['curb_weight', 'engine_size', 'horsepower', 'city_mpg']
plot_scatter_shape(auto_prices, num_cols)   

Comparing the results to those obtained in the visualization lab, it does appear that the relationships between curb_weight and log_price and city_mpg and log_price are more linear, compared to the relationships between curb_weight and price and city_mpg and price respectively.

The relationship with the log_price and categorical variables should likely also be investigated. It is also possible that some type of power transformation should be applied to, say horsepower or engine_size. In the interest of brevity, these ideas are not pursued here.

Before proceeding, answer Question 2 on the course page.

In [ ]:
# Let's save the dataframe to a csv file 
# We will use this in the next module so that we don't have to re-do the steps above
# You don't have to run this code as the csv file has been saved under the next module's folder
#auto_prices.to_csv('Auto_Data_Preped.csv', index = False, header = True)

Another example

Next, you will prepare the German credit data. Execute the code in the cell below to load the dataset and print the head (first 5 rows) of the dataframe.

In [20]:
credit = pd.read_csv('German_Credit.csv', header=None)
credit.head()
Out[20]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 1122334 A11 6 A34 A43 1169 A65 A75 4 A93 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 6156361 A12 48 A32 A43 5951 A61 A73 2 A92 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 2051359 A14 12 A34 A46 2096 A61 A74 2 A93 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 8740590 A11 42 A32 A42 7882 A61 A74 2 A93 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 3924540 A11 24 A33 A40 4870 A61 A73 3 A93 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 22 columns

This dataset is a bit hard to understand. For a start, the column names are not human readable.

Recode character strings

You have likely noticed that the the column names are not human readable. This can be changed as was done for the previous dataset. Execute the code in the cell below to add human-readable column names to the data frame.

In [21]:
credit.columns = ['customer_id', 'checking_account_status', 'loan_duration_mo', 'credit_history', 
                  'purpose', 'loan_amount', 'savings_account_balance', 
                  'time_employed_yrs', 'payment_pcnt_income','gender_status', 
                  'other_signators', 'time_in_residence', 'property', 'age_yrs',
                  'other_credit_outstanding', 'home_ownership', 'number_loans', 
                  'job_category', 'dependents', 'telephone', 'foreign_worker', 
                  'bad_credit']
credit.head()
Out[21]:
customer_id checking_account_status loan_duration_mo credit_history purpose loan_amount savings_account_balance time_employed_yrs payment_pcnt_income gender_status ... property age_yrs other_credit_outstanding home_ownership number_loans job_category dependents telephone foreign_worker bad_credit
0 1122334 A11 6 A34 A43 1169 A65 A75 4 A93 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 6156361 A12 48 A32 A43 5951 A61 A73 2 A92 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 2051359 A14 12 A34 A46 2096 A61 A74 2 A93 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 8740590 A11 42 A32 A42 7882 A61 A74 2 A93 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 3924540 A11 24 A33 A40 4870 A61 A73 3 A93 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 22 columns

Next, there is a trickier problem to deal with. The current coding of the categorical variables is impossible to understand. This makes interpreting these variables nearly impossible.

The code in the cell below uses a list of dictionaries to recode the categorical features with human-readable text. The final dictionary in the list recodes good and bad credit as a binary variable, $\{ 0,1 \}$. Two iterators are used to apply the dictionary:

  1. The for loop iterates over the columns and indexes the dictionary for the column.
  2. A list comprehension iterates of the values in the column and uses the dictionary to map the codes to human-readable category names.

Execute this code and examine the result:

In [22]:
code_list = [['checking_account_status', 
              {'A11' : '< 0 DM', 
               'A12' : '0 - 200 DM', 
               'A13' : '> 200 DM or salary assignment', 
               'A14' : 'none'}],
            ['credit_history',
            {'A30' : 'no credit - paid', 
             'A31' : 'all loans at bank paid', 
             'A32' : 'current loans paid', 
             'A33' : 'past payment delays', 
             'A34' : 'critical account - other non-bank loans'}],
            ['purpose',
            {'A40' : 'car (new)', 
             'A41' : 'car (used)',
             'A42' : 'furniture/equipment',
             'A43' : 'radio/television', 
             'A44' : 'domestic appliances', 
             'A45' : 'repairs', 
             'A46' : 'education', 
             'A47' : 'vacation',
             'A48' : 'retraining',
             'A49' : 'business', 
             'A410' : 'other' }],
            ['savings_account_balance',
            {'A61' : '< 100 DM', 
             'A62' : '100 - 500 DM', 
             'A63' : '500 - 1000 DM', 
             'A64' : '>= 1000 DM',
             'A65' : 'unknown/none' }],
            ['time_employed_yrs',
            {'A71' : 'unemployed',
             'A72' : '< 1 year', 
             'A73' : '1 - 4 years', 
             'A74' : '4 - 7 years', 
             'A75' : '>= 7 years'}],
            ['gender_status',
            {'A91' : 'male-divorced/separated', 
             'A92' : 'female-divorced/separated/married',
             'A93' : 'male-single', 
             'A94' : 'male-married/widowed', 
             'A95' : 'female-single'}],
            ['other_signators',
            {'A101' : 'none', 
             'A102' : 'co-applicant', 
             'A103' : 'guarantor'}],
            ['property',
            {'A121' : 'real estate',
             'A122' : 'building society savings/life insurance', 
             'A123' : 'car or other',
             'A124' : 'unknown-none' }],
            ['other_credit_outstanding',
            {'A141' : 'bank', 
             'A142' : 'stores', 
             'A143' : 'none'}],
             ['home_ownership',
            {'A151' : 'rent', 
             'A152' : 'own', 
             'A153' : 'for free'}],
            ['job_category',
            {'A171' : 'unemployed-unskilled-non-resident', 
             'A172' : 'unskilled-resident', 
             'A173' : 'skilled',
             'A174' : 'highly skilled'}],
            ['telephone', 
            {'A191' : 'none', 
             'A192' : 'yes'}],
            ['foreign_worker',
            {'A201' : 'yes', 
             'A202' : 'no'}],
            ['bad_credit',
            {2 : 1,
             1 : 0}]]

for col_dic in code_list:
    col = col_dic[0]
    dic = col_dic[1]
    credit[col] = [dic[x] for x in credit[col]]
    
credit.head()    
Out[22]:
customer_id checking_account_status loan_duration_mo credit_history purpose loan_amount savings_account_balance time_employed_yrs payment_pcnt_income gender_status ... property age_yrs other_credit_outstanding home_ownership number_loans job_category dependents telephone foreign_worker bad_credit
0 1122334 < 0 DM 6 critical account - other non-bank loans radio/television 1169 unknown/none >= 7 years 4 male-single ... real estate 67 none own 2 skilled 1 yes yes 0
1 6156361 0 - 200 DM 48 current loans paid radio/television 5951 < 100 DM 1 - 4 years 2 female-divorced/separated/married ... real estate 22 none own 1 skilled 1 none yes 1
2 2051359 none 12 critical account - other non-bank loans education 2096 < 100 DM 4 - 7 years 2 male-single ... real estate 49 none own 1 unskilled-resident 2 none yes 0
3 8740590 < 0 DM 42 current loans paid furniture/equipment 7882 < 100 DM 4 - 7 years 2 male-single ... building society savings/life insurance 45 none for free 1 skilled 2 none yes 0
4 3924540 < 0 DM 24 past payment delays car (new) 4870 < 100 DM 1 - 4 years 3 male-single ... unknown-none 53 none for free 2 skilled 2 none yes 1

5 rows × 22 columns

The categorical values are now coded in a human readable manner.

Remove duplicate rows

Duplicate cases can seriously bias the training of machine learning models. In simple terms, cases which are duplicates add undue weight to that case when training a machine learning model. Therefore, it is necessary to ensure there are no duplicates in the dataset before training a model.

One must be careful when determining if a case is a duplicate or not. It is possible that some cases have identical values, particularly if most or all features are categorical. On the other hand, if there are columns with values guaranteed to be unique these can be used to detect and remove duplicates.

Another consideration when removing duplicate cases is determining which case to remove. If the duplicates have different dates of creation, the newest date is often selected. In the absence of such a criteria, the choice is often arbitrary. You may chose to keep the first case or the last case.

The German credit data has a customer_id column which should be unique. In the previous lab, we simply remove the customer_id. Turns out, this identifier column will be useful to determine duplicate rows. The presence of duplicates can be determined by comparing the number of rows to the number of unique values of the identifier column, in this case the customer_id column. The code in the cell below prints the shape of the data frame and the number of unique customer_id values.

Execute this code, examine the results, and answer Question 3 on the course page.

In [25]:
np.sum(credit.customer_id.value_counts() >1)
Out[25]:
12
In [26]:
print(credit.shape)
print(credit.customer_id.unique().shape)
(1012, 22)
(1000,)

There are 12 duplicate cases. These need to be located and the duplicates removed. In this case, the first instance will be kept.

The code in the cell below removes these duplicates from the data frame inplace and the number of remaining rows and unique customer_ids are printed. Execute this code and examine the results.

In [ ]:
credit.drop_duplicates(subset = 'customer_id', keep = 'first', inplace = True)
print(credit.shape)
print(credit.customer_id.unique().shape)

The duplicate rows have been successfully removed.

In [ ]:
# Let's save the dataframe to a csv file 
# We will use this in the next module so that we don't have to re-do the steps above
# You don't have to run this code as the csv file has been saved under the next module's folder
#credit.to_csv('German_Credit_Preped.csv', index = False, header = True)

Feature engineering

Some feature engineering needs to be investigated to determine if any improvement in predictive power can be expected. From the previous data exploration, it is apparent that several of the numeric features had a strong left skew. A log transformation may help in a case like this.

Execute the code in the cell below uses the Pandas applymap method to apply the log function to each element of several columns in the data frame. Execute this code.

In [ ]:
credit[['log_loan_duration_mo', 'log_loan_amount', 'log_age_yrs']] = credit[['loan_duration_mo', 'loan_amount', 'age_yrs']].applymap(math.log)

Next, execute the code in the cell below to visualize the differences in the distributions of the untransformed and transformed variables for the two label values.

In [ ]:
num_cols = ['log_loan_duration_mo', 'log_loan_amount', 'log_age_yrs',
                   'loan_duration_mo', 'loan_amount', 'age_yrs']

for col in num_cols:
    print(col)
    _ = plt.figure(figsize = (10,4))
    sns.violinplot(x= 'bad_credit', y = col, hue = 'bad_credit', 
                   data = credit)
    plt.ylabel('value')
    plt.xlabel(col)
    plt.show()

The log transformed features have more symmetric distributions. However, it does not appear that the separation of the label cases is improved. These features will not be used further.


Note: Recalling the visualization of the categorical features, there are quite a few categories with few cases. However, it is not clear how these categories can be reasonably combined. It may be the case that some of these categorical features are not terribly predictive.


Summary

Good data preparation is the key to good machine learning performance. Data preparation or data munging is a time interactive and iterative process. Continue to visualize the results as you test ideas. Expect to try many approaches, reject the ones that do not help, and keep the ones that do. In summary, test a lot of ideas, fail fast, keep what works. The reward is that well prepared data can improve the performance of almost any machine learning algorithm.

In [ ]: